8 research outputs found

    Nominalization and Alternations in Biomedical Language

    Get PDF
    Background: This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings: We examined 1,872 tokens of the ten most common domain-specific verbs or their zerorelated nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance: We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedica

    Exploring Semantic roles for Named Entity Recognition in the Molecular biology domain

    No full text
    Named entity recognition (NER) in the molecular biology domain, the task of identifying and categorizing molecular entities appearing in text, is one of the most important tasks in a biological text mining engine. In general, this task is taken as the first step towards the more ambitious task of molecular event extraction (relation extraction)and, eventually, pathway discovery. However, NER in this scientific domain, which seems to be the easiest task among others in text mining, still achieves quite low performance. As can be seen from the most recent shared-task evaluations of NER in this domain(JNLPBA-2004), the best performance in terms of Fl-score is only 72.6. This result is far below what is achieved by NER system in newswire domain (Fl-score of about 96%) which is near the human level of performance. At present, most NER systems employ term internal features (e.g., lexical and morphology) and co-occurrence information as term external features. Due to the lack of molecular naming convention, which leads to the difficulty of terminological variations as well as the difficulty of polysemy (i.e. the sharing of names between different entities), such features are insufficient to handle the difficulties for NER in the molecular biology domain. To obtain a complete set of rules for lexical patterns of molecular names seem impossible, thus to use term external features other than co-occurrence information is of interest.  In this thesis, the semantic relationships between a predicate and its arguments in terms of semantic roles are proposed to enhance NER system in the molecular biology domain. The semantic role information is derived from a predicate-argument structure (PAS) which is a higher sentence representation level than syntactic relation and surface form levels. Thus, the use of semantic roles is more consistent than co-occurrence information derived from a surface level. To employ the semantic role for NER system, it is realized in various sets of syntactic features which were used by a machine learning model to explore the most efficient way in allowing this knowledge to provide the highest positive effect on the NER.  As a result, the best feature set composed of the 6 lexical features (i.e., surface word, lemma form, orthographic feature, part-of-speech, phrase-chunk and head word of NP-chunk) and 4 PAS-related features for representing an argument\u27s semantic role (i.e., predicate\u27s surface form, predicate\u27s lemma, voice and the united feature of subject-object head\u27s lemma and transitive-intransitive sense). Moreover, the use of semantic roles can show the positive effects for only the predicates conforming to the criteria as follows. A predicate must have its arguments as both agent and theme with a higher probability of belonging to a named entity class than non-named entity class; otherwise, a predicate must have its arguments as both agent and theme with a lower probability of belonging to a named entity class than non-named entity class and the number of training examples for this predicate should be large enough (by observing from empirical evidences, at least 270 sentences). The improvement in performance obtained from the NER system using PAS-related features, compared to not using these features, affirms that the using of semantic roles can enhance NER system

    Predicate-argument structure for , belonging to group B – , , is shown as Frame 3

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "PASBio: predicate-argument structures for event extraction in molecular biology"</p><p>BMC Bioinformatics 2004;5():155-155.</p><p>Published online 19 Oct 2004</p><p>PMCID:PMC535924.</p><p>Copyright © 2004 Wattarujeekrit et al; licensee BioMed Central Ltd.</p> Though is used to mean in both biological corpus and business news corpus, set of arguments are not the same. Use of is illustrated here. Similar to predicate , PASBio's predicate-argument structure of has less arguments than in PropBank [22, 23] as shown in Frame 4

    Sentences (1)–(3), three different sentences using predicate taken from MEDLINE 36 and EMBO 37 Journal articles, are given as examples to illustrate the variation of the language usage in biological articles

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "PASBio: predicate-argument structures for event extraction in molecular biology"</p><p>BMC Bioinformatics 2004;5():155-155.</p><p>Published online 19 Oct 2004</p><p>PMCID:PMC535924.</p><p>Copyright © 2004 Wattarujeekrit et al; licensee BioMed Central Ltd.</p> To convey the information marked as [...] or [...] or [...] can be written in various forms as discussed in the main text. Similarly, the variation of surface linguistic expressions can also be seen from sentences (4)–(6) conveying event . Sentence (6) is an example to show that the domain knowledge is really necessary for correct understanding

    Genome Informatics 14: 677--678 (2003) 677 Open Ontology Forge: A Tool for Ontology Creation

    No full text
    this paper, we will introduce Open Ontology Forge (OOF), a software tool for ontology creation, terminology annotation, and coreference annotation by experts applied to the biomedical domain. Encoding expert&apos;s knowledge of this domain in a consistent and machine understandable way is important in order to make the knowledge publicly available and improve the quality of information extraction from the vast and growing amount of texts such as online journals. OOF provides a convenient environment for knowledge-encoding by biomedical experts, reducing much e#ort and maintaining the consistency of encoding at the same tim
    corecore